-
-
Notifications
You must be signed in to change notification settings - Fork 18.4k
REF: simplify CSVFormatter #36046
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REF: simplify CSVFormatter #36046
Conversation
Needed to eliminate compression setter due to the interdependencies between ioargs and compression.
faf1b4b
to
22955db
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks pretty good a few comments
To make sure that the newer mypy (v0.782) passes.
This eliminates repetition of the type annotations for index label in multiple places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the update @ivanovmg couple of questions
pandas/_typing.py
Outdated
@@ -82,6 +83,7 @@ | |||
|
|||
Axis = Union[str, int] | |||
Label = Optional[Hashable] | |||
IndexLabel = Optional[Union[bool, str, Sequence[Label]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this just be Optional[Union[Label, Sequence[Label]]]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I understand, index label can be bool (at least in the original implementation). So, I left it exactly as it was.
It can probably be changed into Optional[Union[bool, Label, Sequence[Label]]]
, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replaced IndexLabel as suggested in b7dae11
pandas/io/formats/csvs.py
Outdated
@@ -75,50 +83,92 @@ def __init__( | |||
self.should_close = ioargs.should_close | |||
self.mode = ioargs.mode | |||
|
|||
# GH21227 internal compression is not used for non-binary handles. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't this actually be in get_filepath_or_buffer
instead? in pandas/io/common.py
?
cc @twoertwein
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is currently not in get_filepath_or_buffer
(but it would be a good place for it!). We have the same code also in to_csv
:
pandas/pandas/io/formats/csvs.py
Line 162 in bdb6e26
# GH21227 internal compression is not used for non-binary handles. |
If I didn't miss a function, all callers of
get_filepath_or_buffer
use the returned compression from it (if they care about compression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved this check on compression method into pandas/io/common.py
. See ca888c1.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you! With this change, I think we don't need the warning in pandas/pandas/io/formats/csvs.py
line 162 anymore.
get_filepath_or_buffer
opens some URL-like strings in binary mode (because pandas supports more operations in binary mode) even if the caller specified a non-binary mode (unless the caller explicitly specified text mode with "t"
). If (isinstance(filepath_or_buffer, str) and is_url(filepath_or_buffer)) or is_fsspec_url(filepath_or_buffer)
, you will need to use fsspec_mode
instead of mode
. I'm not sure how keen the pandas devs are on function decorates, I can imagine that it might be easier to first call get_filepath_or_buffer
(it will decide whether to use binary/text mode) and then throw the warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@twoertwein, indeed I removed the warning from pandas/io/formats/csvs.py
. All of these checks are not handed to the function get_filepath_or_buffer
.
I am not sure I understood: is there anything else to be done on the topic? Because it seems that everything is fine now: tests pass and there is no check on the compression method in CSVFormatter
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm surprised that it doesn't break this test
pandas/pandas/tests/io/test_gcs.py
Line 78 in c688a0f
def test_to_csv_compression_encoding_gcs(gcs_buffer, compression_only, encoding): |
It should call
get_filepath_or_buffer
with mode='w'
or mode='r'
but get_filepath_or_buffer
will use fsspec_mode
which adds the binary flag.
edit: I just realized why this PR doesn't break this test: fsspecs are strings, they have no write
method.
pandas/io/formats/csvs.py
Outdated
|
||
@property | ||
def quotechar(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add return type annotions as much as possible on these property / functions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did that.
pandas/io/formats/csvs.py
Outdated
return self._cols | ||
|
||
@cols.setter | ||
def cols(self, cols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and type if you can
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, but with the limitations.
See 1346995
For some reason mypy would not recognize that chunksize turns from Optional[int] to int inside the setter. Even setting an intentional assertion ``assert chunksize is not None`` does not help.
Limitations: - ignore type[assignment] error. - Created additional method _refine_cols to allow conversion from Optional[Sequence[Label]] to Sequence[Label].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
very nice, one small question on typing. pls ping on green.
pandas/_typing.py
Outdated
@@ -82,6 +83,7 @@ | |||
|
|||
Axis = Union[str, int] | |||
Label = Optional[Hashable] | |||
IndexLabel = Optional[Union[bool, str, Sequence[Label]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bool should be included in Hashable though? if that's the case then I don't think you need it here (or is somthing breaking)?
pandas/io/formats/csvs.py
Outdated
index_label = [index_label] | ||
self._index_label = index_label | ||
|
||
def _get_index_label_from_obj(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ideally type the returns of these
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry @ivanovmg one more set of comments, ping on green.
pandas/io/formats/csvs.py
Outdated
# save it | ||
self.cols = cols | ||
@property | ||
def _number_format(self) -> dict: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
technically this should be Dict[str, Any]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! Corrected typing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @simonjayhawkins don't we have a mypy setting that fails on this? (or should be have a code-check)?
see #30539
pandas/io/formats/csvs.py
Outdated
return bool(self._has_aliases or self.header) | ||
|
||
@property | ||
def write_cols(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you type here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/io/formats/csvs.py
Outdated
if not index: | ||
self.nlevels = 0 | ||
@property | ||
def encoded_labels(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca you type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
@jreback, is there anything else to be done on this PR? Or good to go? |
thanks @ivanovmg very nice |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had just started reviewing the code as it's been merged, so this comment is just for info.
@@ -82,6 +83,7 @@ | |||
|
|||
Axis = Union[str, int] | |||
Label = Optional[Hashable] | |||
IndexLabel = Optional[Union[Label, Sequence[Label]]] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the Optional in Label above is to include None in Label. (On hindsight, I think this could have been Union[Hashable, None] for clarity but we don't use the Union[..., None] pattern)
And also I'm not sure how we got the Ordered alias below.
In pandas._typing, I would generally prefer that we don't include the Optional, and just add it when needed in the annotations. I think this in generally allows more use of the aliases (i.e. after setting a default since we don't always set defaults in the signatures)
so in the code, you would have
index_label: Optional[IndexLabel] = None
instead of
index_label: IndexLabel = None
of course Label includes None anyway so the Optional isn't needed anyway. it's just a stylistic preference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@simonjayhawkins, I agree that it is more reasonable to define Label and IndexLabel without Optional. I will take a look at this in a separate PR.
* REF: extract properties cols and has_mi_columns * REF: extract property chunksize * REF: extract property quotechar * REF: extract properties data_index and nlevels * REF: refactor _save_chunk * REF: refactor _save * REF: extract method _save_body * REF: reorder _save-like methods * REF: extract compression property * REF: Extract property index_label * REF: extract helper properties * REF: delete local variables in _save_header * REF: extract method _get_header_rows * REF: move check for header into _save function * TYP: add several type annotations * FIX: fix index labels * FIX: fix multiindex * FIX: fix test failures on compression Needed to eliminate compression setter due to the interdependencies between ioargs and compression. * REF: eliminate preallocation of self.data * REF: extract method _convert_to_native_types * REF: rename regular -> flat as reviewed * TYP: add type annotations as reviewed * REF: refactor number formatting Replace _convert_to_native_types method in favor of a number formatting dictionary. * FIX: mypy error with index_label * FIX: reorder if-statements in index_label To make sure that the newer mypy (v0.782) passes. * TYP: move IndexLabel to pandas._typing This eliminates repetition of the type annotations for index label in multiple places. * TYP: quotechar, has_mi_columns, _need_to_save... * TYP: chunksize, but ignored assignment check For some reason mypy would not recognize that chunksize turns from Optional[int] to int inside the setter. Even setting an intentional assertion ``assert chunksize is not None`` does not help. * TYP: cols property Limitations: - ignore type[assignment] error. - Created additional method _refine_cols to allow conversion from Optional[Sequence[Label]] to Sequence[Label]. * TYP: nlevels and _has_aliases * CLN: move GH21227 check to pandas/io/common.py * TYP: remove redundant bool from IndexLabel type * TYP: add to _get_index_label... methods * TYP: use Iterator instead of Generator * TYP: explicitly use List type * TYP: correct dict typing * TYP: remaining properties
black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff
Refactor CSVFormatter